This analysis explores Airbnb listing data from New York City to understand pricing patterns and market characteristics. Key findings include:
This document focuses on the exploratory data analysis of Airbnb listing data from New York City to understand pricing patterns and market characteristics.
# Load the dataset
bnb = read.csv("/Users/andresperez/Desktop/R Files/bnb_project/data/bnb_listing_rev.csv")
# Display the first few rows
cat("First few rows of the dataset:\n")## First few rows of the dataset:
| id | name | host_id | host_name | host_since | neighbourhood | neighbourhood_group | latitude | longitude | room_type | accommodates | bathrooms_text | bedrooms | beds | price | number_of_reviews | last_review |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 40560656 | The Hoxton, Williamsburg - Cosy Room | 273324213 | The Hoxton Williamsburg | 7/3/2019 | Williamsburg | Brooklyn | 40.72096 | -73.95860 | Hotel room | 2 | NA | NA | 0 | 75 | 11/17/2022 | |
| 41740615 | The James New York - NoMad | 268417148 | The James NoMad | 6/13/2019 | Midtown | Manhattan | 40.74459 | -73.98574 | Hotel room | 4 | NA | NA | 0 | 48 | 10/10/2022 | |
| 41740622 | Gardens Suites Hotel by Affinia | 269311462 | Gardens Suites Hotel By Affinia | 6/17/2019 | Upper East Side | Manhattan | 40.76442 | -73.96303 | Hotel room | 6 | NA | NA | 0 | 5 | 10/1/2022 | |
| 41792753 | Mint House At 70 Pine | 197053492 | Mint House At 70 Pine | 6/21/2018 | Financial District | Manhattan | 40.70651 | -74.00777 | Hotel room | 6 | NA | NA | 0 | 0 | ||
| 42065543 | Broadway Plaza Hotel | 307634016 | Broadway Plaza | 11/6/2019 | Midtown | Manhattan | 40.74444 | -73.98920 | Hotel room | 4 | NA | NA | 0 | 329 | 11/4/2022 | |
| 42065545 | Carvi Hotel NYC | 310429455 | Carvi Hotel | 11/18/2019 | Midtown | Manhattan | 40.75917 | -73.96926 | Hotel room | 4 | NA | NA | 0 | 0 |
# Display basic information about the dataset
cat("\nDataset Dimensions:", dim(bnb)[1], "rows and", dim(bnb)[2], "columns\n")##
## Dataset Dimensions: 40683 rows and 17 columns
# Display column names and their types
str_output = capture.output(str(bnb))
cat("\nColumn Types:\n")##
## Column Types:
## 'data.frame': 40683 obs. of 17 variables:
## $ id : num 40560656 41740615 41740622 41792753 42065543 ...
## $ name : chr "The Hoxton, Williamsburg - Cosy Room" "The James New York - NoMad" "Gardens Suites Hotel by Affinia" "Mint House At 70 Pine" ...
## $ host_id : int 273324213 268417148 269311462 197053492 307634016 310429455 308721299 309714886 307633956 309772430 ...
## $ host_name : chr "The Hoxton Williamsburg" "The James NoMad" "Gardens Suites Hotel By Affinia" "Mint House At 70 Pine" ...
## $ host_since : chr "7/3/2019" "6/13/2019" "6/17/2019" "6/21/2018" ...
## $ neighbourhood : chr "Williamsburg" "Midtown" "Upper East Side" "Financial District" ...
## $ neighbourhood_group: chr "Brooklyn" "Manhattan" "Manhattan" "Manhattan" ...
## $ latitude : num 40.7 40.7 40.8 40.7 40.7 ...
## $ longitude : num -74 -74 -74 -74 -74 ...
## $ room_type : chr "Hotel room" "Hotel room" "Hotel room" "Hotel room" ...
## $ accommodates : int 2 4 6 6 4 4 4 2 4 4 ...
## $ bathrooms_text : chr "" "" "" "" ...
## $ bedrooms : int NA NA NA NA NA NA NA NA NA NA ...
## $ beds : int NA NA NA NA NA NA NA NA NA NA ...
## $ price : int 0 0 0 0 0 0 0 0 0 0 ...
## $ number_of_reviews : int 75 48 5 0 329 0 0 1 0 0 ...
## $ last_review : chr "11/17/2022" "10/10/2022" "10/1/2022" "" ...
# Missing values analysis
missing_analysis = data.frame(
Variable = names(bnb),
Missing_Count = colSums(is.na(bnb)),
Missing_Percent = round(colSums(is.na(bnb)) / nrow(bnb) * 100, 2)
) %>%
arrange(desc(Missing_Percent))
# Display variables with missing values
cat("\nMissing Values Analysis:\n")##
## Missing Values Analysis:
knitr::kable(missing_analysis %>% filter(Missing_Percent > 0),
caption = "Variables with Missing Values")| Variable | Missing_Count | Missing_Percent | |
|---|---|---|---|
| bedrooms | bedrooms | 3781 | 9.29 |
| beds | beds | 920 | 2.26 |
# Check for duplicate rows
duplicate_count = sum(duplicated(bnb))
cat("\nNumber of duplicate rows:", duplicate_count, "\n")##
## Number of duplicate rows: 4
# Check for zero or negative prices
if("price" %in% names(bnb)) {
zero_prices = sum(bnb$price <= 0, na.rm = TRUE)
cat("Number of listings with zero or negative prices:", zero_prices, "\n")
}## Number of listings with zero or negative prices: 30
##
## Unique values in categorical variables:
if("room_type" %in% names(bnb)) {
cat("\nRoom Types:", paste(unique(bnb$room_type), collapse = ", "), "\n")
}##
## Room Types: Hotel room, Entire home/apt, Private room, Shared room
if("neighbourhood_group" %in% names(bnb)) {
cat("Boroughs:", paste(unique(bnb$neighbourhood_group), collapse = ", "), "\n")
}## Boroughs: Brooklyn, Manhattan, Bronx, Queens, Staten Island
# Identify numerical columns for summary
numerical_cols = names(bnb)[sapply(bnb, is.numeric)]
cat("\nNumerical columns found:", paste(numerical_cols, collapse = ", "), "\n")##
## Numerical columns found: id, host_id, latitude, longitude, accommodates, bedrooms, beds, price, number_of_reviews
##
## Summary of numerical variables:
## id host_id latitude longitude
## Min. :2.595e+03 Min. : 2438 Min. :40.50 Min. :-74.25
## 1st Qu.:1.806e+07 1st Qu.: 14722718 1st Qu.:40.69 1st Qu.:-73.98
## Median :4.090e+07 Median : 64484703 Median :40.72 Median :-73.95
## Mean :1.688e+17 Mean :138935351 Mean :40.73 Mean :-73.94
## 3rd Qu.:5.438e+07 3rd Qu.:238989521 3rd Qu.:40.76 3rd Qu.:-73.92
## Max. :7.740e+17 Max. :489996729 Max. :40.91 Max. :-73.71
##
## accommodates bedrooms beds price
## Min. : 0.000 Min. : 1.000 Min. : 1.000 Min. : 0.0
## 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 80.0
## Median : 2.000 Median : 1.000 Median : 1.000 Median :130.0
## Mean : 2.862 Mean : 1.352 Mean : 1.606 Mean :173.5
## 3rd Qu.: 4.000 3rd Qu.: 1.000 3rd Qu.: 2.000 3rd Qu.:210.0
## Max. :16.000 Max. :11.000 Max. :21.000 Max. :999.0
## NA's :3781 NA's :920
## number_of_reviews
## Min. : 0.00
## 1st Qu.: 1.00
## Median : 5.00
## Mean : 26.56
## 3rd Qu.: 25.00
## Max. :1666.00
##
## First few rows of the dataset:
| id | name | host_id | host_name | host_since | neighbourhood | neighbourhood_group | latitude | longitude | room_type | accommodates | bathrooms_text | bedrooms | beds | price | number_of_reviews | last_review |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 40560656 | The Hoxton, Williamsburg - Cosy Room | 273324213 | The Hoxton Williamsburg | 7/3/2019 | Williamsburg | Brooklyn | 40.72096 | -73.95860 | Hotel room | 2 | NA | NA | 0 | 75 | 11/17/2022 | |
| 41740615 | The James New York - NoMad | 268417148 | The James NoMad | 6/13/2019 | Midtown | Manhattan | 40.74459 | -73.98574 | Hotel room | 4 | NA | NA | 0 | 48 | 10/10/2022 | |
| 41740622 | Gardens Suites Hotel by Affinia | 269311462 | Gardens Suites Hotel By Affinia | 6/17/2019 | Upper East Side | Manhattan | 40.76442 | -73.96303 | Hotel room | 6 | NA | NA | 0 | 5 | 10/1/2022 | |
| 41792753 | Mint House At 70 Pine | 197053492 | Mint House At 70 Pine | 6/21/2018 | Financial District | Manhattan | 40.70651 | -74.00777 | Hotel room | 6 | NA | NA | 0 | 0 | ||
| 42065543 | Broadway Plaza Hotel | 307634016 | Broadway Plaza | 11/6/2019 | Midtown | Manhattan | 40.74444 | -73.98920 | Hotel room | 4 | NA | NA | 0 | 329 | 11/4/2022 | |
| 42065545 | Carvi Hotel NYC | 310429455 | Carvi Hotel | 11/18/2019 | Midtown | Manhattan | 40.75917 | -73.96926 | Hotel room | 4 | NA | NA | 0 | 0 |
##
## Dataset Information:
## - Number of rows: 40683
## - Number of columns: 17
##
## Column types:
col_types = sapply(bnb, class)
for(col in names(col_types)) {
cat("- ", col, ": ", col_types[col], "\n")
}## - id : numeric
## - name : character
## - host_id : integer
## - host_name : character
## - host_since : character
## - neighbourhood : character
## - neighbourhood_group : character
## - latitude : numeric
## - longitude : numeric
## - room_type : character
## - accommodates : integer
## - bathrooms_text : character
## - bedrooms : integer
## - beds : integer
## - price : integer
## - number_of_reviews : integer
## - last_review : character
Based on the initial exploration above, we will implement the following cleaning steps:
# Clean the dataset
clean_bnb = bnb %>%
# Remove ID column and duplicates
select(-id) %>%
distinct() %>%
# Remove invalid prices
filter(price > 0)
# Process bathrooms and handle missing shared values
clean_bnb = clean_bnb %>%
mutate(
# Extract bathroom number
bathrooms = as.integer(sapply(strsplit(bathrooms_text, " "), "[", 1)),
# Extract shared status, handling NAs
shared = sapply(strsplit(bathrooms_text, " "), function(x) {
if(length(x) > 1) {
ifelse(x[2] == "shared", 1, 0)
} else {
0 # Default to not shared if information is missing
}
})
)
# Process dates and handle missing values
clean_bnb = clean_bnb %>%
mutate(
# Convert host_since to date
host_since = mdy(host_since),
# For missing dates, use the earliest date in the dataset
host_since = if_else(is.na(host_since),
min(host_since, na.rm = TRUE),
host_since),
# Convert to integer and create year
host_since = as.integer(host_since),
host_since_year = year(as.Date(host_since, origin = "1970-01-01"))
)
# Remove last_review if it exists
if("last_review" %in% names(clean_bnb)) {
clean_bnb = clean_bnb %>% select(-last_review)
}
# Impute missing values for important features
clean_bnb = clean_bnb %>%
group_by(neighbourhood_group, room_type) %>%
mutate(
bedrooms = ifelse(is.na(bedrooms),
median(bedrooms, na.rm = TRUE),
bedrooms),
beds = ifelse(is.na(beds),
median(beds, na.rm = TRUE),
beds),
bathrooms = ifelse(is.na(bathrooms),
median(bathrooms, na.rm = TRUE),
bathrooms)
) %>%
ungroup()
# Handle any remaining NAs with overall medians
clean_bnb = clean_bnb %>%
mutate(
bedrooms = ifelse(is.na(bedrooms), median(bedrooms, na.rm = TRUE), bedrooms),
beds = ifelse(is.na(beds), median(beds, na.rm = TRUE), beds),
bathrooms = ifelse(is.na(bathrooms), median(bathrooms, na.rm = TRUE), bathrooms)
)
# Convert factors
clean_bnb$neighbourhood = as.factor(clean_bnb$neighbourhood)
clean_bnb$neighbourhood_group = as.factor(clean_bnb$neighbourhood_group)
clean_bnb$room_type = as.factor(clean_bnb$room_type)
# Verify cleaning results
cat("\nCleaning Results:\n")##
## Cleaning Results:
## - Initial rows: 40683
## - Rows after cleaning: 40647
## - Duplicates removed: 4
## - Zero/negative prices removed: 30
# Check for any remaining missing values
remaining_nas = colSums(is.na(clean_bnb))
if(sum(remaining_nas) > 0) {
cat("\nRemaining missing values by column:\n")
print(remaining_nas[remaining_nas > 0])
} else {
cat("\nNo missing values remaining in the dataset.\n")
}##
## No missing values remaining in the dataset.
# Check for any remaining duplicates
remaining_duplicates = sum(duplicated(clean_bnb))
if(remaining_duplicates > 0) {
cat("\nWARNING: Still found", remaining_duplicates, "duplicate rows. Removing them now.\n")
clean_bnb = distinct(clean_bnb)
} else {
cat("\nNo duplicate rows remaining.\n")
}##
## WARNING: Still found 1 duplicate rows. Removing them now.
# Replace original dataset with cleaned version
bnb = clean_bnb
# Final verification
cat("\nFinal Dataset Status:\n")##
## Final Dataset Status:
## - Total rows: 40646
## - Total columns: 18
## - Missing values: 0
## - Duplicates: 0
Now that we have properly cleaned data, we can proceed with our exploratory analysis.
# Count of rentals by borough
boroughs = bnb %>%
group_by(neighbourhood_group) %>%
summarize(count=n()) %>%
arrange(-count)
# Visualize borough distribution
ggplot(boroughs, aes(x = reorder(neighbourhood_group, -count), y = count)) +
geom_bar(stat = "identity", fill = "skyblue") +
geom_text(aes(label = count), vjust = -0.5, size = 4) + # Add count labels
theme_minimal() +
labs(title = "Number of Listings by Borough",
x = "Borough",
y = "Number of Listings") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) + # Angle borough names for better readability
scale_y_continuous(expand = expansion(mult = c(0, 0.1))) # Add some space at the top for labels# Price analysis by borough
borough_summary_price = bnb %>%
group_by(neighbourhood_group) %>%
summarize(
min_price = min(price),
max_price = max(price),
average_price = mean(price),
median_price = median(price),
total_listings = n()
) %>%
arrange(-average_price)
# Visualize price distribution
ggplot(bnb, aes(x = neighbourhood_group, y = price)) +
geom_boxplot(fill = "skyblue") +
theme_minimal() +
labs(title = "Price Distribution by Borough",
x = "Borough",
y = "Price") +
coord_flip() +
scale_y_continuous(limits = c(0, quantile(bnb$price, 0.95))) # Remove extreme outliers for better visualization# Room type distribution by borough
ggplot(bnb, aes(x = neighbourhood_group, fill = room_type)) +
geom_bar(position = "fill") +
theme_minimal() +
labs(title = "Room Type Distribution by Borough",
x = "Borough",
y = "Proportion",
fill = "Room Type") +
scale_fill_brewer(palette = "Set2")# Accommodation size analysis
ggplot(bnb, aes(x = accommodates)) +
geom_histogram(binwidth = 1, fill = "skyblue") +
facet_wrap(~neighbourhood_group) +
theme_minimal() +
labs(title = "Distribution of Accommodation Size by Borough",
x = "Number of People Accommodated",
y = "Count")# Bathroom sharing analysis
bathroom_dist = bnb %>%
group_by(neighbourhood_group, shared) %>%
summarize(count = n(), .groups = 'drop') %>%
mutate(shared = ifelse(shared == 1, "Shared", "Private"))
ggplot(bathroom_dist, aes(x = neighbourhood_group, y = count, fill = shared)) +
geom_bar(stat = "identity", position = "fill") +
theme_minimal() +
labs(title = "Bathroom Sharing Distribution by Borough",
x = "Borough",
y = "Proportion",
fill = "Bathroom Type")# Prepare data for mapping
DF.sub = bnb %>%
mutate(
Lat = round(latitude, 4),
Lon = round(longitude, 4),
logprice = log(price)
)
# Create aggregated data for visualization
DF.sub.plot = DF.sub %>%
group_by(neighbourhood, Lon, Lat) %>%
summarize(
Price = mean(price),
LogPrice = mean(logprice),
Listings = n(),
.groups = 'drop'
)# Create color palette for prices
price_pal = colorNumeric(
palette = "YlOrRd",
domain = DF.sub.plot$Price
)
# Create interactive map
leaflet(DF.sub.plot) %>%
addTiles() %>%
addCircleMarkers(
lng = ~Lon,
lat = ~Lat,
radius = ~sqrt(Listings) * 3,
color = ~price_pal(Price),
fillOpacity = 0.7,
popup = ~paste(
"Neighborhood:", neighbourhood,
"<br>Average Price: $", round(Price, 2),
"<br>Number of Listings:", Listings
)
) %>%
addLegend(
position = "bottomright",
pal = price_pal,
values = ~Price,
title = "Average Price ($)",
opacity = 0.7
)# Analyze hosting patterns over time
host_growth = bnb %>%
mutate(year = host_since_year) %>%
group_by(year) %>%
summarize(
new_hosts = n(),
avg_price = mean(price, na.rm = TRUE)
) %>%
filter(!is.na(year))
# Visualize host growth
ggplot(host_growth, aes(x = year)) +
geom_line(aes(y = new_hosts, color = "New Hosts")) +
geom_line(aes(y = avg_price, color = "Average Price")) +
scale_y_continuous(
name = "Number of New Hosts",
sec.axis = sec_axis(~., name = "Average Price ($)")
) +
theme_minimal() +
labs(title = "Growth in Hosts and Prices Over Time",
x = "Year",
color = "Metric") +
theme(legend.position = "bottom")# Analyze price distribution
ggplot(bnb, aes(x = price)) +
geom_histogram(bins = 50, fill = "skyblue") +
theme_minimal() +
labs(title = "Distribution of Listing Prices",
x = "Price ($)",
y = "Count") +
scale_x_continuous(limits = c(0, quantile(bnb$price, 0.95)))# Log-transformed price distribution
ggplot(bnb, aes(x = log(price))) +
geom_histogram(bins = 50, fill = "skyblue") +
theme_minimal() +
labs(title = "Distribution of Log-Transformed Prices",
x = "Log(Price)",
y = "Count")# Price per Accommodation
bnb = bnb %>%
mutate(
price_per_person = price / accommodates,
bed_efficiency = beds / accommodates
)
# Analyze price per person by room type
ggplot(bnb, aes(x = room_type, y = price_per_person)) +
geom_boxplot(fill = "skyblue") +
theme_minimal() +
labs(title = "Price per Person by Room Type",
x = "Room Type",
y = "Price per Person ($)") +
coord_flip() +
scale_y_continuous(limits = c(0, quantile(bnb$price_per_person, 0.95)))# Analyze price by listing age
ggplot(bnb, aes(x = 2024 - host_since_year, y = price)) +
geom_point(alpha = 0.1) +
geom_smooth(method = "loess", color = "red") +
theme_minimal() +
labs(title = "Price vs Listing Age",
x = "Listing Age (Years)",
y = "Price ($)") +
scale_y_continuous(limits = c(0, quantile(bnb$price, 0.95)))# Analyze room capacity utilization
ggplot(bnb, aes(x = bed_efficiency)) +
geom_histogram(bins = 30, fill = "skyblue") +
facet_wrap(~room_type) +
theme_minimal() +
labs(title = "Bed Efficiency by Room Type",
x = "Beds per Person Accommodated",
y = "Count")# Select numerical variables
numerical_vars = bnb %>%
select(price, accommodates, bedrooms, beds, bathrooms,
number_of_reviews, host_since_year,
latitude, longitude)
# Calculate correlation matrix
correlation_matrix = cor(numerical_vars, use = "complete.obs")
# Convert correlation matrix to long format for ggplot
corr_long = melt(correlation_matrix)
# Create correlation plot with ggplot2
ggplot(corr_long, aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
geom_text(aes(label = sprintf("%.1f", value)),
color = ifelse(abs(corr_long$value) > 0.7, "white", "black")) +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1)) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.title = element_blank()) +
labs(title = "Correlation Matrix of Numerical Variables",
fill = "Correlation") +
coord_fixed()# Strong correlations analysis (|correlation| > 0.5)
strong_correlations = which(abs(correlation_matrix) > 0.5 & correlation_matrix != 1, arr.ind = TRUE)
strong_corr_df = data.frame(
Variable1 = rownames(correlation_matrix)[strong_correlations[,1]],
Variable2 = colnames(correlation_matrix)[strong_correlations[,2]],
Correlation = correlation_matrix[strong_correlations]
) %>%
arrange(desc(abs(Correlation)))
# Display strong correlations
knitr::kable(strong_corr_df,
caption = "Strong Correlations (|correlation| > 0.5)",
digits = 2)| Variable1 | Variable2 | Correlation |
|---|---|---|
| beds | accommodates | 0.77 |
| accommodates | beds | 0.77 |
| beds | bedrooms | 0.73 |
| bedrooms | beds | 0.73 |
| bedrooms | accommodates | 0.69 |
| accommodates | bedrooms | 0.69 |
# 1. ANOVA: Price differences between boroughs
borough_anova = aov(price ~ neighbourhood_group, data = bnb)
borough_anova_summary = summary(borough_anova)
# 2. Chi-square test: Room type distribution across boroughs
room_borough_table = table(bnb$neighbourhood_group, bnb$room_type)
room_borough_chi = chisq.test(room_borough_table)
# 3. T-test: Price difference between shared and private bathrooms
bathroom_ttest = t.test(price ~ shared, data = bnb)
# 4. Correlation test for price vs number of reviews
review_cor = cor.test(bnb$price, bnb$number_of_reviews)
# 5. Linear regression for price vs accommodates
price_accom_lm = lm(price ~ accommodates, data = bnb)
price_accom_summary = summary(price_accom_lm)
# Display results
cat("Statistical Test Results:\n\n")## Statistical Test Results:
## 1. Borough Price Differences (ANOVA):
## F-statistic: 933.16
## p-value: < 2.22e-16
## 2. Room Type Distribution (Chi-square):
## Chi-square statistic: 1054.97
## p-value: < 2.22e-16
## 3. Bathroom Type Price Difference (T-test):
## t-statistic: 112.29
## p-value: < 2.22e-16
## 4. Price-Reviews Correlation:
## Correlation coefficient: -0.011
## p-value: 0.031109
## 5. Price-Accommodates Relationship:
## R-squared: 0.215
## p-value: < 2.22e-16
These statistical tests provide: - Validation of observed patterns - Quantification of relationships - Evidence-based decision support - Foundation for modeling choices
Borough-Specific Patterns: Each borough shows distinct pricing patterns and market characteristics.
Bathroom Impact: The presence of private vs. shared bathrooms significantly affects pricing across all boroughs.
Important Features:
Geographic Insights:
Temporal Patterns: